Shop Customer Data¶

Life cycle of Machine learning model

  • Understanding the Problem Statement
  • Data Collection
  • Data Cleaning
  • Exploratory data analysis
  • Data Pre-Processing
  • Modeling
  • Model Evaluation
  • Model Deployment

The Shop Customer Data is a comprehensive dataset that offers valuable insights into the ideal customers of a hypothetical shop. It collects and analyzes customer data through membership cards, providing a detailed understanding of the customer base.

The dataset consists of 2000 records with 8 columns, each representing a specific aspect of the customer's profile. These columns include Customer ID, Gender, Age, Annual Income, Spending Score, Profession, Work Experience, and Family Size.

Analyzing this data helps businesses gain insights into customer preferences, behaviors, and purchasing habits.For example, segmentation based on age, income, or family size can reveal how these factors influence purchasing decisions.

Here's a breakdown of the key points about each column:

  • Customer ID: A unique identifier assigned to each customer for tracking purchases and behaviors.
  • Gender: Indicates the customer's gender, allowing for analysis of purchasing behavior between genders.
  • Age: Represents the customer's age in years, facilitating segmentation and identifying age-related purchasing patterns.
  • Annual Income: Reflects the customer's yearly income, enabling segmentation by income groups to identify purchasing preferences.
  • Spending Score: A score given by the shop based on the customer's spending behavior. It helps segment customers based on purchasing patterns.
  • Profession: Indicates the customer's occupation or profession, aiding in the analysis of purchasing patterns across different professions.
  • Work Experience: Represents the number of years of work experience for the customer. It helps identify purchasing preferences based on experience levels.
  • Family Size: Indicates the number of family members for the customer, allowing for analysis of purchasing patterns based on family size.

Importing Required Packages¶

  • Importing Pandas, Numpy, Matplotlib, Seaborn, Ploty Libraries for various Operations
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline
#Display all the columns of the dataframe
pd.pandas.set_option("display.max_columns", None)

Import CSV as dataset¶

  • Reading the CSV and storing it as Pandas DataFrame
In [2]:
# creating dataframe
df = pd.read_csv("Customers.csv")

# printing the shape of dataset
print(df.shape)
(2000, 8)
  • There are 2000 instances and 8 columns in the dataset
In [3]:
df.head()
Out[3]:
CustomerID Gender Age Annual Income ($) Spending Score (1-100) Profession Work Experience Family Size
0 1 Male 19 15000 39 Healthcare 1 4
1 2 Male 21 35000 81 Engineer 3 3
2 3 Female 20 86000 6 Engineer 1 1
3 4 Female 23 59000 77 Lawyer 0 2
4 5 Female 31 38000 40 Entertainment 2 6
In [4]:
# Let's look at the datatypes of different features
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              2000 non-null   int64 
 1   Gender                  2000 non-null   object
 2   Age                     2000 non-null   int64 
 3   Annual Income ($)       2000 non-null   int64 
 4   Spending Score (1-100)  2000 non-null   int64 
 5   Profession              1965 non-null   object
 6   Work Experience         2000 non-null   int64 
 7   Family Size             2000 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 125.1+ KB
In [5]:
df_num=df.select_dtypes(include=np.number)
df_cat=df.select_dtypes(include='object')
In [6]:
print("There are",len(df_num.columns),"numerical variables in the dataset:",list(df_num.columns))
print("\n")
print("There are",len(df_cat.columns),"categorical variables in the dataset:",list(df_cat.columns))
There are 6 numerical variables in the dataset: ['CustomerID', 'Age', 'Annual Income ($)', 'Spending Score (1-100)', 'Work Experience', 'Family Size']


There are 2 categorical variables in the dataset: ['Gender', 'Profession']

Data Cleaning¶

Handling missing values and other anomolies¶

  • We will check the missing values in the dataset
In [7]:
df.isnull().sum()
Out[7]:
CustomerID                 0
Gender                     0
Age                        0
Annual Income ($)          0
Spending Score (1-100)     0
Profession                35
Work Experience            0
Family Size                0
dtype: int64
  • There's only one column having the null values i.e. Profession
In [8]:
(df.isnull().mean())*100
Out[8]:
CustomerID                0.00
Gender                    0.00
Age                       0.00
Annual Income ($)         0.00
Spending Score (1-100)    0.00
Profession                1.75
Work Experience           0.00
Family Size               0.00
dtype: float64
  • We will look at the values present in the Profession for imputing the null values
In [9]:
df['Profession'].value_counts()
Out[9]:
Artist           612
Healthcare       339
Entertainment    234
Engineer         179
Doctor           161
Executive        153
Lawyer           142
Marketing         85
Homemaker         60
Name: Profession, dtype: int64
  • as the column has only 1.75% null values which is quite low, we can impute the null values using the mode
In [10]:
df['Profession'].mode()[0]
Out[10]:
'Artist'
In [11]:
df.Profession.fillna(df['Profession'].mode()[0], inplace=True)
In [12]:
df.isnull().sum()
Out[12]:
CustomerID                0
Gender                    0
Age                       0
Annual Income ($)         0
Spending Score (1-100)    0
Profession                0
Work Experience           0
Family Size               0
dtype: int64

No null values in the dataset now!

Check for Duplicates¶

In [13]:
df.duplicated().sum()
Out[13]:
0
  • There are no duplicate records present in the dataset.

Exploring the dataset¶

In [14]:
df.head()
Out[14]:
CustomerID Gender Age Annual Income ($) Spending Score (1-100) Profession Work Experience Family Size
0 1 Male 19 15000 39 Healthcare 1 4
1 2 Male 21 35000 81 Engineer 3 3
2 3 Female 20 86000 6 Engineer 1 1
3 4 Female 23 59000 77 Lawyer 0 2
4 5 Female 31 38000 40 Entertainment 2 6
In [15]:
# Fivepoint summary of the dataset
df.describe()
Out[15]:
CustomerID Age Annual Income ($) Spending Score (1-100) Work Experience Family Size
count 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000 2000.000000
mean 1000.500000 48.960000 110731.821500 50.962500 4.102500 3.768500
std 577.494589 28.429747 45739.536688 27.934661 3.922204 1.970749
min 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 500.750000 25.000000 74572.000000 28.000000 1.000000 2.000000
50% 1000.500000 48.000000 110045.000000 50.000000 3.000000 4.000000
75% 1500.250000 73.000000 149092.750000 75.000000 7.000000 5.000000
max 2000.000000 99.000000 189974.000000 100.000000 17.000000 9.000000
In [16]:
df['Gender'].value_counts()
Out[16]:
Female    1186
Male       814
Name: Gender, dtype: int64
In [17]:
# Create a pie chart to visualize the distribution of gender in the dataset
fig = px.pie(values=df['Gender'].value_counts(), names=df['Gender'].value_counts().index)

# Enhance the plot by adding a title and labels
fig.update_layout(title="Distribution of Gender in the Dataset")

# Create a bar chart to visualize the distribution of gender in the dataset 
fig2 = px.bar(y=df['Gender'].value_counts(), x=df['Gender'].value_counts().index, color=df['Gender'].value_counts().index)

# Display the plot
fig.show()
fig2.show()
  • The dataset clearly shows that there are many more female candidates than male candidates, which suggests a preference for females. Specifically, there are about 1,200 female candidates and only around 800 male candidates in the dataset.

It's important to understand that this bias could affect how well machine learning models perform when trained on this dataset. This is especially true if the dataset is used to predict results or make decisions that could be influenced by gender.

In [18]:
df['Age'].value_counts()
Out[18]:
31    31
32    30
52    30
91    29
63    28
      ..
42    12
10    12
77    12
71    12
98     9
Name: Age, Length: 100, dtype: int64
In [19]:
import plotly.express as p
fig = px.histogram(df, x="Age",
                   title='Histogram of Age',
                   labels={'Age':'Age'}, # can specify one label per df column
                   opacity=0.8,
                   log_y=True, # represent bars with log scale
                   color_discrete_sequence=['indianred'] # color of histogram bars
                   )
fig.show()
  • The histogram illustrates the distribution of ages in the dataset, spanning from 0 to 99. The age group with the highest frequency is 30-34, with 136 individuals, while the age band with the lowest count is 70-77, with only 77 individuals.

It is worth noting that there is an inaccuracy in the dataset where the minimum age is listed as 0. This is problematic because it suggests that some customers have not been born yet. To gain a better understanding of this inaccuracy, further analysis is required.

In [20]:
df['Annual Income ($)'].value_counts()
Out[20]:
50000     7
9000      7
97000     6
85000     6
4000      6
         ..
111859    1
186655    1
164598    1
132951    1
110610    1
Name: Annual Income ($), Length: 1786, dtype: int64
In [21]:
# Create a histogram of the 'Age' column, and include the box plot to show the distribution
fig = px.histogram(df, x='Age', marginal='box')

# Display the plot
fig.show()
  • After examining the age distribution of the dataset, we find that it is relatively consistent across the entire age spectrum, ranging from 0 to 100. While there is a minor peak in the 30-34 age range that slightly deviates from a normal distribution, it is still within an acceptable range. Therefore, from a statistical standpoint, we can conclude that the data is not skewed or biased in terms of age.

Moreover, in the context of machine learning and deep learning, it is crucial to have a well-balanced dataset that encompasses all age groups to ensure optimal performance of the models. Thus, the even distribution of age in this dataset can be advantageous for constructing accurate and robust models that can generalize effectively to new, unseen data

In [22]:
df['Spending Score (1-100)'].value_counts()
Out[22]:
49    34
42    33
55    32
17    31
46    28
      ..
72    12
6     12
9     12
95    12
0      2
Name: Spending Score (1-100), Length: 101, dtype: int64
In [23]:
# Create a histogram of the 'Spending Score (1-100)' column, and include the box plot to show the distribution
fig = px.histogram(df, x='Spending Score (1-100)', marginal='box')

# Display the plot
fig.show()
  • Upon analyzing the spending score histogram, we can see that the frequency counts are mostly consistent. However, there is a noticeable peak in the frequency count within the range of 45 to 49. This peak shows that more people have spending scores in that range compared to other scores. As we move towards higher scores, the frequency count gradually increases. Towards the end of the histogram, there is a sharp decrease in the frequency count, reaching as low as 18. At the beginning, the frequency count was around 80, reached a peak of 134, and then dropped to 18. These variations indicate significant changes in the counts. Despite these fluctuations, the distribution follows a consistent pattern, with scores gradually shifting up and down.
In [24]:
df['Profession'].value_counts()
Out[24]:
Artist           647
Healthcare       339
Entertainment    234
Engineer         179
Doctor           161
Executive        153
Lawyer           142
Marketing         85
Homemaker         60
Name: Profession, dtype: int64
In [25]:
## dataframe creation - for plotting
# create new pandas dataframe which contains all counts sorted by profession
profession_df = (
    df.groupby(["Profession"])
    .size()
    .reset_index(name="Counts")
    .sort_values(by=["Profession"])
)
In [26]:
profession_df
Out[26]:
Profession Counts
0 Artist 647
1 Doctor 161
2 Engineer 179
3 Entertainment 234
4 Executive 153
5 Healthcare 339
6 Homemaker 60
7 Lawyer 142
8 Marketing 85
In [27]:
import plotly.graph_objs as go
In [28]:
# Create labels using all unique values in the column named "Profession"
labels = profession_df["Profession"].unique()

# Group by count of the "profession" column
values = profession_df["Counts"]

# Custom define a list of colors to be used for the pie chart
earth_colors = [
    "rgb(210,180,140)",
    "rgb(218,165,32)",
    "rgb(139,69,19)",
    "rgb(175, 51, 21)",
    "rgb(35, 36, 21)",
    "rgb(188,143,143)",
    "rgb(50, 205, 50)",
    "rgb(128, 128, 128)",
    "rgb(70, 130, 180)",
]

# Define the actual figure using the dimension: profession
# Note that a pull keyword was specified to explode pie pieces out of the center
fig = go.Figure(
    data=[
        go.Pie(
            labels=labels,
            values=values,
            # pull is given as a fraction of the pie radius
            pull=[0.08, 0.03, 0.07, 0.08, 0.02, 0.2, 0.05, 0.04, 0],
            # Iterate through earth_colors list to color individual pie pieces
            marker_colors=earth_colors,
        )
    ]
)

# Update layout to show a title
fig.update_layout(title_text="Pie chart of Profession")

# Display the figure
fig.show()
  • The pie chart above represents the distribution of professions in the dataset. It reveals that Artists have the highest count among all professions, while Homemakers have the lowest count. The distribution of Engineers and Doctors is nearly identical, indicating a similar representation in the dataset. Similarly, Executive and Lawyers also exhibit a comparable distribution, suggesting a similar proportion within the dataset.
In [29]:
df['Work Experience'].value_counts()
Out[29]:
1     470
0     431
8     166
9     160
7     126
4     121
6     120
5     117
10     84
2      63
3      55
12     17
13     16
14     16
11     14
15     14
16      5
17      5
Name: Work Experience, dtype: int64
In [30]:
# Create a bar chart to visualize the distribution of work experience in the dataset 
fig = px.bar(y=df['Work Experience'].value_counts(), x=df['Work Experience'].value_counts().index, color=df['Work Experience'].value_counts().index)

# Display the plot
fig.show()
  • The dataset contains approximately 470 individuals with only 1 year of work experience, which is the highest count among all experience levels. Following this, there are around 431 freshers or individuals with no prior work experience. These numbers suggest that a significant portion of the dataset consists of individuals in the early stages of their careers.

On the other hand, there are about 10 individuals in the dataset who possess more than 16 years of work experience. These individuals likely hold senior positions in their respective companies, given their extensive experience.

The remaining individuals in the dataset fall within the middle range, with work experience ranging from 4 to 10 years. This group represents a sizable portion of the dataset and likely includes professionals at various stages of their careers.

In [31]:
df['Family Size'].value_counts()
Out[31]:
2    361
3    311
1    299
4    289
5    258
6    243
7    234
8      4
9      1
Name: Family Size, dtype: int64
In [32]:
## dataframe creation - for plotting
# create new pandas dataframe which contains all counts sorted by profession
family_df = (
    df.groupby(["Family Size"])
    .size()
    .reset_index(name="Counts")
    .sort_values(by=["Family Size"])
)
In [33]:
family_df
Out[33]:
Family Size Counts
0 1 299
1 2 361
2 3 311
3 4 289
4 5 258
5 6 243
6 7 234
7 8 4
8 9 1
In [34]:
# Create labels using all unique values in the column named "Family"
labels = family_df["Family Size"].unique()

# Group by count of the "Family" column
values = family_df["Counts"]

# Custom define a list of colors to be used for the pie chart
earth_colors = [
    "rgb(210,180,140)",
    "rgb(218,165,32)",
    "rgb(139,69,19)",
    "rgb(175, 51, 21)",
    "rgb(35, 36, 21)",
    "rgb(188,143,143)",
    "rgb(50, 205, 50)",
    "rgb(128, 128, 128)",
    "rgb(70, 130, 180)",
]

# Define the actual figure using the dimension: Family
# Note that a pull keyword was specified to explode pie pieces out of the center
fig = go.Figure(
    data=[
        go.Pie(
            labels=labels,
            values=values,
            # pull is given as a fraction of the pie radius
            pull=[0.08, 0.03, 0.07, 0.08, 0.02, 0.2, 0.05, 0.04, 0],
            # Iterate through earth_colors list to color individual pie pieces
            marker_colors=earth_colors,
        )
    ]
)

# Update layout to show a title
fig.update_layout(title_text="Pie chart of Family Size")

# Display the figure
fig.show()
  • From the pie chart analysis, it is evident that the distribution of family sizes in the dataset is generally symmetrical, except for one particular case. The majority of families in the dataset consist of 2 members, which could indicate couples without children or single parents. This category has the highest percentage in the chart. Following this, the distribution for family sizes 3, 1, and 4 is relatively similar, indicating a comparable representation in the dataset. However, it is worth noting that there is only one family with 9 members, which is an outlier in terms of family size.
In [35]:
# Create a box plot of Age by Gender
age_gender_boxplot = px.box(df, x='Gender', y='Age', color='Gender', title='Distribution of Age by Gender')

# Display the plot
age_gender_boxplot.show()
  • The box plot offers valuable insights regarding the distribution of age based on gender. It is apparent from the plot that gender does not seem to have a substantial influence on the age distribution. Both males and females exhibit similar distributions of age, indicating no significant disparities between the genders in terms of age.
In [36]:
# Create a box plot of Annual Income by Gender
anual_income_gender_boxplot = px.box(df, x='Gender', y='Annual Income ($)', color='Gender', title='Distribution of Annual Income ($) by Gender')

# Display the plot
anual_income_gender_boxplot.show()
  • The distribution of annual income in relation to gender follows a similar pattern as the distribution of age in relation to gender. Gender does not have a notable impact on the distribution of annual income. There are no significant differences between genders in terms of annual income distribution.
In [37]:
# Create a box plot of Spending Score by Gender
spending_score_gender_boxplot = px.box(df, x='Gender', y='Spending Score (1-100)', color='Gender', title='Distribution of Spending Score (1-100) by Gender')

# Display the plot
spending_score_gender_boxplot.show()
In [38]:
# Create a box plot of Spending Score by Gender
work_experience_gender_boxplot = px.box(df, x='Gender', y='Work Experience', color='Gender', title='Distribution of Work Experience by Gender')

# Display the plot
work_experience_gender_boxplot.show()
  • Based on the exploratory data analysis conducted on the "Gender" feature, it can be inferred that there is no substantial correlation between the distribution of gender and any other feature. This indicates that gender does not play a significant role in predicting the values of other variables. This finding holds significance in the context of machine learning, as it suggests that incorporating gender as a feature in predictive models may not result in significant improvements in accuracy.
In [39]:
# Create box plot for Age versus Profession
fig = px.box(df, x='Age', y='Profession', color='Profession', title='Age Distribution across Professions')

# Display the plot
fig.show()
  • By examining the box plot, we can deduce that there is a slight association between age and profession, unlike what is depicted in the violin plot. The distribution of age across different professions is not uniform, indicating variations in age distribution among different occupations. These observations hold significance from the perspective of machine learning and deep learning, as they can be instrumental in predicting an individual's profession based on their age.

Furthermore, it is important to note that the median age differs across professions. For instance, the median age for engineers tends to be higher, suggesting that a significant proportion of engineers are in their 60s. Conversely, the mode for marketing professionals and homemakers skews towards a younger age range, indicating that the majority of individuals in these professions are in their 40s. This information can be valuable in formulating targeted marketing strategies tailored to specific age groups or professions.

In [40]:
import plotly.graph_objects as go

data=df.copy()

# Assuming you have a DataFrame called 'data' with columns 'Age' and 'Work Experience'

fig = go.Figure(data=go.Scatter(
    x=data['Work Experience'],
    y=data['Age'],
    mode='markers',
    marker=dict(
        size=8,
        color=data['Age'],  # Color points based on age for added interactivity
        colorscale='Viridis',  # Choose a color scale
        showscale=True  # Display color scale
    ),
    text=data['Age'],  # Display age as hover text
    hovertemplate='Age: %{text}<br>Work Experience: %{x}',  # Customize hover text
))

fig.update_layout(
    title='Age vs Work Experience',
    xaxis_title='Work Experience',
    yaxis_title='Age',
    hovermode='closest',  # Show closest data point when hovering
)

fig.show()
  • The plot depicting the relationship between age and work experience uncovers an intriguing pattern - the increase in work experience does not appear to align with the expected trend of increasing with age. This deviation from the conventional real-world expectation suggests that the dataset might not accurately represent the true distribution found in the population. It is possible that the dataset is idealized and does not reflect the complexities and variations present in real-world scenarios. Another possibility is the presence of additional factors that influence work experience regardless of age, such as career changes or educational pursuits.
In [41]:
fig = go.Figure(data=go.Scatter(
    x=df['Age'],
    y=df['Spending Score (1-100)'],
    mode='markers',
    marker=dict(
        size=8,
        color=df['Age'],  # Color points based on age for added interactivity
        colorscale='Viridis',  # Choose a color scale
        showscale=True  # Display color scale
    ),
    text=df['Age'],  # Display age as hover text
    hovertemplate='Age: %{text}<br>Spending Score: %{y}',  # Customize hover text
))

fig.update_layout(
    title='Age vs Spending Score',
    xaxis_title='Age',
    yaxis_title='Spending Score (1-100)',
    hovermode='closest',  # Show closest data point when hovering
)

fig.show()
  • The absence of a discernible pattern between Age and Spending Score in the dataset is intriguing. Typically, it is commonly observed that younger individuals tend to spend more money on purchases, while older individuals tend to spend less. However, in this particular dataset, such a relationship is not evident.

  • There could be several reasons for this lack of correlation. It's possible that factors other than age, such as income level, personal preferences, or individual circumstances, have a stronger influence on spending behavior in this dataset. Additionally, the dataset may not accurately represent the general population, as it could be limited to a specific demographic or a unique set of individuals with distinct spending habits.

In [42]:
fig = go.Figure(data=go.Scatter(
    x=df['Annual Income ($)'],
    y=df['Spending Score (1-100)'],
    mode='markers',
    marker=dict(
        size=8,
        color=df['Annual Income ($)'],  # Color points based on annual income for added interactivity
        colorscale='Viridis',  # Choose a color scale
        showscale=True  # Display color scale
    ),
    text=df['Annual Income ($)'],  # Display annual income as hover text
    hovertemplate='Annual Income: $%{text}<br>Spending Score: %{y}',  # Customize hover text
))

fig.update_layout(
    title='Annual Income vs Spending Score',
    xaxis_title='Annual Income ($)',
    yaxis_title='Spending Score (1-100)',
    hovermode='closest',  # Show closest data point when hovering
)

fig.show()
  • The scatter plot clearly indicates a positive relationship between Annual Income ($) and Spending Score (1-100). As people's annual income increases, their spending score also tends to increase. This observation suggests that individuals with higher incomes have more disposable income available for spending, enabling them to make more purchases and potentially engage in higher levels of consumption. The positive correlation between income and spending score implies that individuals with higher incomes may have a greater capacity to afford and indulge in a variety of goods and services.
In [43]:
# Create a box plot for annual income grouped by profession
fig = px.box(df, y='Annual Income ($)', x='Profession', color="Profession")

# Set the title of the plot
fig.update_layout(title_text='Annual Income Distribution by Profession')

# Show the plot
fig.show()
  • Upon further analysis of the relationship between annual income and profession, it is observed that income distributions for certain professions like healthcare, engineering, law, entertainment, art, executive, and medicine tend to stay relatively consistent. However, there are some noticeable changes in the income distribution for individuals in the home-making profession, where the lower income values slightly increase.

Additionally, the median income for the mentioned professions remains stable around 100K, while the median income for individuals in the home-making profession shows a slight decrease. Interestingly, the marketing profession stands out with a consistent income distribution but an upward shift in the median income.

In [44]:
fig = px.scatter(df, x='Work Experience', y='Annual Income ($)', color='Work Experience', hover_data=['Annual Income ($)'])
fig.update_layout(title='Annual Income ($) vs Work Experience', xaxis_title='Work Experience', yaxis_title='Annual Income ($)')
fig.show()
  • Upon analyzing the scatter plot, an interesting observation is made that challenges the common belief that higher years of experience should correspond to higher annual income. Contrary to expectations, the plot reveals that there is no clear relationship between years of experience and the amount of annual income earned.

  • This finding is surprising because it goes against the notion that individuals with more experience should generally earn higher salaries. In the dataset, even individuals with no prior work experience (freshers) have high annual incomes, with some earning up to 189.945K USD. On the other hand, the highest annual income for a person with 17 years of experience is 180.331K USD, which is lower than some freshers' incomes.

  • This observation suggests that other factors, such as job role, industry, education level, or negotiation skills, may have a stronger influence on annual income than years of experience alone. It highlights the complexity of the relationship between experience and income and emphasizes the importance of considering multiple factors when analyzing salary trends.

In [45]:
fig = px.box(df, x='Profession', y='Work Experience', color='Gender',
             title='Work Experience by Profession and Gender',
             labels={'Work Experience': 'Years of Work Experience'})

fig.show()

Upon analyzing the box plot of work experience across different professions, several important findings emerge.

  1. Professions like healthcare, executive, doctor, and marketing show a wider range of work experience compared to other sectors. However, the lawyer and entertainment sectors have a relatively low median work experience of only one year.

  2. In contrast, the healthcare, executive, and doctor professions have median work experience ranging from one to around eight years, which is more in line with expectations for these fields. The median work experience for doctors is lower at just two years, which may indicate room for improvement.

  3. Some notable outliers are observed, such as individuals with 17 years of work experience in the lawyer and artist sectors, which is impressive.

  4. The homemaker profession stands out with a wider range of work experience, spanning from around three to nine years. This suggests that once individuals enter this profession, they tend to stay for a longer period. The median work experience for homemakers is also relatively high, with the maximum median value in the entire distribution being around seven years.

  5. Gender differences contribute to variations in the distribution. For example, in healthcare, the median work experience is lower for females and higher for males, potentially reflecting gender norms and the perception of doctors as male and nurses as female.

  6. In engineering, females have a higher median work experience compared to males. Similarly, in the doctor profession, females have one year of work experience, while males have three years, despite having a similar overall range.

  7. For the homemaker profession, men tend to start earlier, with a work experience of around two years, while women have a minimum work experience of four years. However, the median work experience is the same for both genders.

These insights highlight the variations in work experience across professions and shed light on gender differences within specific fields. Understanding these patterns can be useful for making informed decisions related to career choices, workforce planning, and identifying areas for improvement.

In [ ]: